Before you begin this worksheet this week, please reinstall
DigitalMethodsData from GitHub by running:
devtools::install_github("regan008/DigitalMethodsData") in
your console. Also be sure that you have installed the Tidyverse
library.
R has powerful tools for manipulating data. The Tidyverse is a collection of packages for R that are designed for data science. Take a look at the website for the Tidyverse and the list of packages that are included at: https://www.tidyverse.org/packages/
dplyr()We’ll start with dplyr which is described as “a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.” The verbs included in this package are:
select(): picks variables based on their names.mutate(): adds new variables that are functions of
existing variables.filter(): picks cases based on their values.summarise(): reduces multiple values down to a single
summary.arrange(): changes the ordering of the rows.All of these verbs play nicely and combine naturally with
group_by() which allows you to perform any operation “by
group”.
Lets load some data and libraries for our work.
library(DigitalMethodsData)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
data("gayguides")
Lets start with select(). This function allows you to
subset columns using their names and types.
gayguides %>%
select(title, Year)
Notice that this subsetted the data and returned only the title and
year. However, it didn’t modify the gayguides data or save
it to a new variable because we didn’t assign the result to
anything.
select() to take the city and state from gayguides
and add them to a dataframe called “locations”.locations <- gayguides %>%
select(city, state)
I assigned the results of the dplyr function to a variable.
select() to grab all the columns of
gayguides EXCEPT for the city and state? Hint: You might
want to read the documentation for this function.gayguides %>%
select(-c(city, state))
The filter function subsets a data frame and retains all the rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all of the conditions you provide.
gayguides %>% filter(Year > 1980)
Filter also works with the logical values we learned earlier this semester.
gayguides %>% filter(Year == 1970 | Year == 1980)
And strings:
gayguides %>%
filter(city == "Greenville")
gayguides %>%
filter(city == "Greenville" & state == "SC")
gayguides %>%
filter(Year > "1975" & Year < "1980" )
gayguides %>%
filter(city == "Greenville" & state == "SC") %>%
filter(Year > "1975" & Year < "1980" )
gayguides %>%
filter(city != "New York" & city != "San Francisco") %>%
filter(Year == "1975")
gayguides %>%
filter(amenityfeatures == "(G)" | amenityfeatures == "(L)")
#I can't quite figure out the rest of the code, so that it includes observations where (G) and (L) appear together and/or in combination with other categorizations.
The mutate() function adds new variables and preserves
existing one. This is useful when you want to create a new column based
on other values. For example, in the statepopulation
dataset, we want to ask “How much did the population increase between
1800 and 1900 in each state?.” We can do that by subtracting the
population in 1900 from 1800 and storing that value in a new column.
data("statepopulations")
statepopulations %>% mutate(difference = X1900 - X1800)
#BostonWomenVoters %>%
#mutate(birth_year = 1920 - as.numeric(a))
I hashtagged the above code because I kept getting an error code for it. I’m not entirely sure what I’ve done wrong, as the code generated the answer I was looking for.
gayguides into a new column called location? It should
list the city, state. (i.e. San Diego, CA)gayguides %>%
mutate(location = paste(city, ",", state))
Arrange() orders the rows of a data frame by the values
of selected columns. In other words it sorts a data frame by a variable.
In the gayguides data, we can sort the data by year with
the earliest year first. If we wanted the latest year first, we could do
so by using the desc() function.
gayguides %>%
arrange(Year)
gayguides %>%
arrange(desc(Year))
statepopulation data, which state has the
largest population in 1850? Write code that pulls only the relevant
columns (state and 1850) and sorts it accordingly.statepopulations %>%
arrange(desc(X1850)) %>%
select(STATE, X1850)
Arrange is useful for finding the highest and lowest values, but it
returns those values for the entire dataset. group_by(), in
contrast, takes an existing tbl and converts it into a grouped tbl where
operations are performed “by group”. Lets look at what that means in
practice:
mydata <- gayguides %>%
select(title, Year) %>%
group_by(Year)
It doesn’t appear that this did much. But if you hover over this new
variable in your environment pane, you’ll see that its now listed as a
“grouped data frame.” Compare that to gayguides which is
listed as just a data frame. This means that now we can run calculations
on this data and it’ll perform them “by group”. Or, in other words,
it’ll perform operations on each year within the dataset. That’s where
summarize() comes in. summarize() creates a
new data frame with one (or more) rows for each combination of grouping
variables. In this case our grouping is by year, so the resulting data
frame will group records by each year in the gayguides
dataset.
gayguides %>%
select(title, Year) %>%
group_by(Year) %>%
summarize(count = n())
What happened here? In this example, we asked group_by to create groups based on year and then in summarize we created a column called count. We passed it the n() function which gives the current group size. What results, is a dataset that lists each year and how many locations that state has.
gay_guides_summary <- gayguides %>%
select(title, state, Year) %>%
group_by(state, Year)
gay_guides_summary %>%
summarize(count = n())
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
summarize() to find the average age for each occupation in
the Boston Women Voters data?#Boston_occupations <- BostonWomenVoters %>%
#select(Age, Occupation) %>%
#group_by(Occupation)
#Boston_occupations %>%
#summarise(Avg = mean(as.numeric(Age)))
#I think I may have mis-stepped somewhere in this code!
Thsi is another question for which I kept getting an error code, despite the code I wrote giving me the answer I sought.
gayguides data, on average how many locations
did each city in South Carolina have between 1970 and 1975?gg_avg_locations <- gayguides %>%
select(city, state, Year, title) %>%
filter(state == "SC") %>%
group_by(city)
gg_avg_locations %>%
filter(Year > 1970 | Year < 1975) %>%
summarise(count = n()) %>%
summarise(Avg = mean(count))
#I think I may have narrowed down the data too much here.
southernstates <- c("AL", "AR", "FL", "GA", "KY", "LA", "MD", "MS", "NC", "SC", "TN", "TX", "VI", "WV")
gayguides %>%
select(title, state, Year) %>%
filter(Year == "1975", state == southernstates) %>%
group_by(state) %>%
summarise(count = n())
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `state == southernstates`.
## Caused by warning in `state == southernstates`:
## ! longer object length is not a multiple of shorter object length
At some point, you might have a situation where you want to join two
tables together. For example, in the almshouse_admissions
dataset there is a column called “Occupational Code”.
data("almshouse_admissions")
head(almshouse_admissions$Descriptions.by.Clerk)
## [1] 10 50 6 3 6 0
For the purposes of working with this data in R, having only the code isn’t very useful. The code book for these occupations is available here:
almshouse.occupations <- read.csv(file="https://raw.githubusercontent.com/regan008/DigitalMethodsData/main/raw/almshouse-occupationalcodes.csv", header=TRUE)
A join allows us to join these two dataframes together, matching each
row based on the occupational code provided in the
Descriptions.by.Clerk column. To do that we’ll use a
function known as a mutating join. A mutating join allows you to combine
variables from two tables. It first matches observations by their keys,
then copies across variables from one table to the other. In this case
we want to join the matching rows from
almshouse.occupations to almshouse_admissions.
In an ideal world, the column names in the two data frames would match
but since that isn’t the case, we’ll have to specify what columns
left_join should use to join the two data frames.
almshouse_admissions <- left_join(almshouse_admissions, almshouse.occupations, by=c("Descriptions.by.Clerk"="code"))
head(almshouse_admissions)
gayguides. Create a data frame that
includes each of the regions and the total number of locations in 1980.
How many locations appear in the Mountain region in 1980?regions <- read.csv("https://raw.githubusercontent.com/regan008/DigitalMethodsData/main/raw/censusregions.csv")
gayguides <- left_join(gayguides, regions, by=c("state"="State.Code"))
Revised_gayguides <- gayguides %>%
select(title, Year, Region, Division)
Revised_gayguides %>%
filter(Year == "1980", Division == "Mountain")
There are a total of 285 observations in the new dataset that include the specified year and region.
I joined using the “State.Code” variable in the “regions” dataset to pair with the states in the “gayguides” dataset. From there, I simply filtered the new dataset based on the year (1980) and region (Mountain). The result shows exactly 285 observations that fit that criteria.
(@)How much did LGTBQ life grow between 1970 and 1980? Can you create a data frame that computes the growth in the number of locations between 1970 and 1980 for every state? For every region?
gay_growth <- gayguides %>%
select(state, Year, title, Region) %>%
filter(Year == "1970" | Year == "1980") %>%
group_by(state, Region)
gay_growth %>%
summarise(count = n())
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
pivot_longer() and pivot_wider():
Converting Wide and Long DataIt’s possible that you won’t create every dataset you use in R. Sometimes that means the dataset is in a format that isn’t useful for the questions you want to ask. The dataset below is what is referred to as a “wide” data frame. That is in comparison to a “long” data frame (which would be considered tidy data).
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
sc.parks <- read.csv("https://raw.githubusercontent.com/regan008/DigitalMethodsData/main/raw/RecreationData-Wide.csv")
head(sc.parks)
This dataset contains all of the localities in South Carolina along
with information about the types of recreational workers in that city
(paid vs unpaid, male vs female). However, the problem with this dataset
is that every year is a column heading making it difficult to work with.
On the surface this seems like a useful format, partially because it
reads left to right which is how we’re accustomed to reading documents.
Its easy to compare, for example, the number of female paid recreation
workers between 1930 and 1945. But for computational purposes this
format is less than ideal for many types of visualizations and
operations. R provides functions for dealing with this.
pivot_longer() “lengthens” your data by increasing the
number of rows and decreasing the number of columns.
sc.parks <- sc.parks %>%
pivot_longer(!city:type_of_worker, names_to = "year", values_to = "count")
What did this code do? >It grouped all of the years together in one column, and created another column containing the counts for the values given in the “type_of_worker” column.
Here’s another wide data frame. Can you turn this from a wide to a narrow data frame?
rec.spaces <- read.csv("https://raw.githubusercontent.com/regan008/DigitalMethodsData/main/raw/PlayAreabyType.csv")
rec.spaces <- rec.spaces %>%
pivot_longer(!type, names_to = "year", values_to = "count")
The opposite of pivot_longer() is
pivot_wider(). It “widens” data by increasing the number of
columns and decreasing the number of rows. We can revert
sc.parks back to a wide dataset using this function.
sc.parks %>%
pivot_wider(names_from = year, values_from = count)
sc.parks dataset so that the column names are
drawn from the type of recreation worker.sc.parks <- sc.parks %>%
pivot_wider(names_from = type_of_worker, values_from = count)
rec.spaces into a wide dataframe.rec.spaces <- rec.spaces %>%
pivot_wider(names_from = type, values_from = count)
Each of the functions covered in this worksheet are valuable tools for manipulating datasets. But they are more powerful when combined. When using them to pair down a dataset, we are asking and answering a question. For example in this code from earlier in our worksheet:
gayguides %>%
select(title, Year) %>%
group_by(Year) %>%
summarize(count = n())
The implicit question was, “How many locations appear in each year?”.
The judges dataset in provided in the DigitalMethodsData
package is a large, messy, wide dataframe that contains a lot of
information. Look at this dataframe and then compose a question to ask
of the data.
data("judges")
#I would like to know how many SC-born judges in this dataset were appointed by presidents from the Democratic Party.
#First, I identify the variables that will guide my search. In this case, those are "Birth.State" and "Party.of.Appointing.President."
#In addition to these, I narrow down the number of columns I ultimately want to pull, so that I just get the data I want. In addition to the aforementioned variables, I want to pull "Last.Name," "First.Name," "Appointing.President," and "Confirmation.Date."
#From there, I will filter based on the values I've specified for birth state and party affiliation.
SC_judges <- judges %>%
select(Last.Name, First.Name, Birth.State, Appointing.President..1., Party.of.Appointing.President..1., Confirmation.Date..1.) %>%
filter(Birth.State == "SC", Party.of.Appointing.President..1. == "Democratic")
head(SC_judges)
gayguides data (or another
dataset of your choice). What is the question you are asking?#I want to see how many entries in this dataset are bookstores in the states of CA and TX, so that I can contrast the results.
#I'm selecting variables based only on what I consider the most necessary information in this dataset. From there, I'm using the filter function to pull data for the values I specify in my research question. I'm then using the count function to determine how many California bookstores are in this dataset, compared to the number for Texas.
gayguides_books <- gayguides %>%
select(title, type, city, state, Year, Region) %>%
filter(type == "Book Store", state == c("CA", "TX"))
gayguides_books %>%
count(state == "CA")
#I can now see that there are 187 observations for bookstores in California in this dataset, compared to just 84 for the state of Texas.
gayguides %>%
select(title, type, state, Year) %>%
filter(Year == "1983", state == "NY")